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1. INTRODUCTION 

HAR is one of the challenging subjects because of the huge number of human activities, some of the 
activities can be easily noticed some of them are confusing, and some of them require interaction with other 
objects or humans, besides the diversity of the activities, the recognition methods are also diverse. There are 
many types of data required to recognize a human activity, some of them use ambient sensors like 
accelerometer, gyroscope, humidity, and temperature [1], [2]. Some of them get the benefit of the 
smartphone's sensors like accelerometer and gyroscope [3], [4], the other use the radio frequency [5]. But the 
most popular recognizing methods use vision-based recognition [6]-[14]. 

Vision uses images or videos to recognize the activity. Also, there is a lot of challenges for 
recognizing human activities because of the effect of lamination, variance of background. Still, the question 
is how to process these visual data to recognize the activity. The answer there are many techniques most of 
them use machine learning, and deep learning has shown an excellent benefit for recognizing human 
activities, especially the CNN which are very useful for vision-based data recognition. 

In this paper, we will design a neural network architect a.k.a. model, that can be used for human 
activity monitoring, the model proposed of 3D dimensional CNN (3D-CNN), and the purpose of using 
3D-CNN is to extract spatial and temporal features rather than only spatial features, and the activity consists 
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of multiple movements that can be identified by extracting the temporal features from between several 
numbers of frames. 

Our proposed model is a small number of 3D-CNN layers to reduce the amount of processing time 
so that any computer with low computational ability could recognize human activity online without delay. 
CNN was introduced by Fukushima [12]. At the beginning, it was mainly used for image classification, and 
image features extraction, one of the most attractive models that been invented to get into a challenge to 
classify images [13], [14]. During that many video human activity datasets were published and KTH [15] is 
one of the most popular small size datasets. CNN architect encouraged researchers to use it in HAR with 
image data taken from video dataset [10], [16]. Other researchers proposed to use two-stream CNN where the 
first stream is an image data and the second it's optical flow [16], [17] 3D-CNN for HAR most cited articles 
[18] proposed model with TRECVID dataset, and [19] proposed 3D-CNN model for UCF-101 dataset [20] in 
which our model inspired by their model. 


2. RESEARCH METHOD 

This paper proposing to use deep neural networks (DNN) models that consists of many different 
layers, and each layer has its purpose during training and testing. Still, the dominant layer in which the article 
focused on is the Convolutional neural network, which is one of the most widely used neural networks, and 
its central idea is applying filters to the input data or convolute it, and transfer the convoluted data to the next 
layer. CNN was primarily used to deal with image data. Now there are 1D, 2D, and 3D CNN to get the 
benefit of this architect for another type of data with a different number of dimensions. 3D-CNN used for 
three-dimensional data which is very suited for our project, because we are dealing with video data, and the 
reason for using video data rather than image data is the activity made of several consequential movements of 
body parts. This continuous movement can be noticed with successive images, and this is a video. 

Pooling layers used in this paper are max-pooling, which returns the maximum value within a kernel 
size when it wraps around the data, average-pooling which returns the average amount within a kernel size 
that wraps along with the data, and global-average-pooling, which returns the average value of each 
dimension or each kernel in the CNN layer, and that is why it is useful at the final part of the model to reduce 
the number of parameters, and to help the model overcome overfitting dropout layers are used. 


2.1. Proposed models 

The first suggested model shown in Figure 1 is influenced by the model proposed by Tran et al. 
[19]. The model consists of three connected 3D convolutional neural networks with 3*3*3 kernel size for all 
convolutional layers, 6ed by a max-pooling layer with a kernel size of 2*2*2 for all max-pooling layers. 
Zero-padding is used for all convolutional layers, and ‘ReLU’ activation function added after each 
convolutional and fully connected (FC) layers except for the last was SoftMax. 


Input 
30*40*40* 1 


Fran] 
MaxPool 





MaxPool 





ii 
6 classes 


Figure 1. Initial proposed model 


In the second attempt, other convolutional and max-pooling layers added before flatten layer, to 
increase the accuracy. Then dropouts were added in different places in the model with five attempts to get 
high accuracy, where all dropouts were with a 0.5 percentage factor. We try with our model to reduce the 
number of parameters (weights and biases) and return to the first model with dropouts. The accuracy 
increased using the dropouts. But, the number of parameters in huge, because the output of the layer before 
flatten layer was more extensive than before removing convolutional and pooling layers, and as the pooling 
layer is gone the number of parameters increased, in (1)-(3) shows how the number of parameters calculated. 
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So, to solve this problem GlobalAveragePooling3D layer replaced the flatten layer, this replacement reduced 
the number of parameters to about two million parameters, which is very helpful for low or moderate 
computation capabilities machines to deal with online activity recognition. 


No. of parameters in CNN=(filterwiatn * filterneign * filteraepn + 1) * no. of filters (1) 
No. of parameters in FC net=neurons of current layer * neurons of previous layer (2) 
No. of parameters in flatten=multiplication of all previous layer dimension (3) 


3. RESULTS AND DISCUSSION 

The neural network model is trained on the KTH dataset, the dataset is doubled before using it by 
adding a flipped copy of it, 20% of data is taken for test after training, and another 20% taken for the test 
during the training, and 60% were taken for the training process. The model was trained using Tensorflow- 
v2.1 [21] as backend, and Keras-v2.3.1 [22] using Python-language-v3.7.5 as front-end, with batch size=16 
and the shape of the frame taken from the video were 40x40 pixels with one channel (grayscale), from each 
video 30 frames were taken between each frame and another there were four frames between taken frames 
discarded. 

The optimizer of the model was Adam optimizer [23], and with a learning rate of 0.001 and 
categorical cross-entropy loss function, the machine specification was: HP 15 Notebook, Memory: 12288MB 
RAM, Intel-Core 15-3230M processor with four cores, two of them are physical, and the maximum frequency 
is 2.6GHz and Windows 8.1 Enterprise 64-bit OS (6.3, build 9600). 

Validation data controls the training operation. So, if an update made to the model and validation 
data applied to the model and the losses did not improved for three epochs for the learning rate would be 
multiplied by a half and the minimum reduction is 0.0001. If the losses did not improve for 15 epochs 
consequently, the training would be finished before getting to the given number of epochs is 100. 


3.1. Calculating results 

After training operation finishes test samples are pushed to the model to get the response the 
accuracy, precision, recall, and f1_score are calculated using (4)-(7) which uses confusion matrix shown in 
Figure 2 [24], for average loss is calculated using categorical cross-entropy algorithm [25]. 





























Data class | Classified as pos | Classified as neg 
| tp fn 
pos true positive (tp) | false negative (fn) 
fp tn 
neg false positive (fp) | true negative (tn) 
Figure 2. Confusion matrix annotation 
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In (5) shows the way to calculate accuracy which defines the effectiveness of the model overall. 
- ie tpi 
P =A 5 
recision= T pfp (5) 


According to (5) shows the way to calculate precision which determines the matching between the 
label of classes and the calculated labels. In (6) shows the way to calculate recall which demonstrates the 
effectiveness of the model to identify the label of classes. As shown in (7) shows the way to calculate 
F1_score which defines the relation between output data taken from the model after entering data for test and 
the positive labels. 

In (8) show the way to calculate the average loss, where N is the number of samples, M is the 
number of classes, d is the true label or desired output, and y is the calculated or tested output from the 
model. Table 1 shows the calculated results, and it’s figure number. Table 2 shows a comparison of 
accuracies for several studies done on the KTH dataset for human activity recognition and our study 
accuracy, and it is evident that our method has shown a remarkable improvement according to accuracy. 
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Table 1. Calculated results for all models 
No. No. of figure Accuracy % Loss Precision% Recall%  Fl_score % 


l. Figure 3 85.83 0.39 86.30 85.89 86.09 
2: Figure 4 88.75 0.39 88.97 88.97 88.97 
3. Figure 5 92.08 0.23 92.67 91.90 92.28 
4. Figure 6 92.92 0.22 93.08 9295 93.02 


Table 2. Comparison of accuracies of researches done on KTH 


No. Method Accuracy % 
1 Ahmad and Lee [26] 84.83 
2 Taylor et al. [27] 88.00 
3 Qian et al. [28] 88.69 
4 Our method 93.33 


Number the table consecutively according to the first mention (sequential order). 


3.2. Calculating the number of operations for layers 
We want to calculate an approximate number of operations for each layer, the calculations don’t 
include controlling operations, calculations are detailed below: 
a. 3D-CNN each kernel convolutes on the entire input data, and for 3D-CNN with a kernel size of (Kd, 
Kh, Kw), input data of (frames, height, width) and strides are (strided, strideh, stridew), we would have: 


No. of operations per node=(Kd * Kh * Kw + 1)2 * no. of previous kernels (9) 
Output nodes=((frames-Kd)/strided+1) * (height-Kh)/strideh+1) *((width- (10) 
Kw)/stridew+1) *no. of current kernels 

Output nodes=(frames/strided+1) * (height/strideh+1) * (width/stridew+1) * (11) 


no. of current kernels 


In (9) shows the number of operations for each output node came from the convolution operation 
and the power two because we have multiplications, in (10) [29] shows the number of output node and for 
each output node we have. For 3D-CNN layer, but for no zero paddings, if we use padding which is used in 
our proposed model the number of operation would be as shown in (11) [29]. 

b. Maxpooling3D performs comparison operation, for a pool window size of (Pd, Ph, Pw), input data of 
(frames, height, width) and strides are (strided, strideh, stridew), we would have: 


No. of operations per node=Pd * Ph * Pw (12) 


Output nodes=((frames-Pd)/strided+1) * ((height-Ph)/strideh+1) * ((width- (13) 
Pw)/stridew+1) no. of current kernels 
Output nodes=(frames/strided+1) * (height/strideh+1) * (width/stridew+1) * (14) 
current kernels 

In (12) shows the number of operations for each pool window, in (13) shows the number of output 
node without padding which is used in our model and the size of the stride are the same pool window size, in 
(14) shows the number of output nodes if there are zero paddings. 
c. Fully connected has a vast number of parameters as compared with CNN if the number of neurons of 

the previous layer is Nprevious and the number of neurons of the current layer is Ncurrent. 

In (15) shows the number of operations for a fully connected layer, power two because we have 

multiplications. 
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Number of operations=(Nprevious * Ncurrent)2 (15) 

d. Flatten has only one operation which is reshaping the dimensions into one dimension. 

e. Dropout works only during training operation, and it’s just hidden randomly chosen a portion of nodes 
so as not to participate in producing the output at only some point in the training, so for testing it 
doesn’t cost any operations. 

f. | GlobalAveragePooling3D adds nodes for each channel where the previous layer output is (frames, 
height, width, channels), so it adds all the numbers in frames, height, and width for a particular channel 
and divides by the number of (frames * height * width), so the total number of operations is shown 
in (16). 

Number of operations=(frames * height * width) * channels (16) 

Table 3 shows the number of operations for each model proposed according to their figures, and we 


can see the least number of operations is the model with the least number of parameters and high accuracy of 
92.92%. 


Table 3. Number of operations for each model according to its figure 
No. of figure No. of operations 


Figures 3 1.08 * 10” 
Figures 4, 5 5.77 * 10” 
Figure 6 4.77 * 10" 


4. DISCUSSION 

Each result is discussed according to the number of figures. 

Figure 4 shows model architect, confusion matrix, accuracy, and loss figures for the earlier proposed 
model. We can notice that because the last before flatten were not small we got a massive number of the 
parameters for the fully connected layer, also we can see that the model’s learning was saturated in early time 
which completed learning within 30 epochs. Figure 5 shows model architect, confusion matrix, accuracy and 
loss figures for the proposed model after adding Conv3D and MaxPooling before flatten to reduce the 
number of parameters, and the training time and the training finished within 30 epochs. Figure 6, shows 
model architect, confusion matrix, accuracy, and loss figures for the model after adding dropouts after the 
fourth, sixth, and eighth layers, the accuracy for this model shown remarkable improvement and the training 
finished in epoch 100 which is the final demanded epoch. Figure 6 shows model architect, confusion matrix, 
accuracy, and loss figures for the model after changing flatten layer with GlobalAveragePooling3D, which 
reduced the number of parameters and so the training time per epoch. It also reduced the number of 
operations during testing and got a fantastic accuracy of 92.92%, the training was finished in epoch 82. 
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Figure 3. Earlier proposed model; (a) model architecture, (b) confusion matrix 


Development of 3D convolutional neural network to recognize human ... (Malik A. Alsaedi) 


ISSN: 2302-9285 





—@- Training Accuracy 
—@- Validation Accuracy 





20 25 0 


15 
Epochs 


(c) 





—@— Training Loss 
—@-— Validation Loss 


15 20 25 30 
Epochs 


(d) 


Figure 4. Earlier proposed model; (c) accuracy of training and validation, (d) losses of training and validation 
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Layer (type) Output shape Parameters 
Conv3D 30, 40, 40, 64 1792 
MaxPooling3D 15, 20, 20, 64 0 
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Figure 5. Adding fourth convolutional and max-pooling layers; (a) model architecture, (b) confusion matrix, 
(c) accuracy of training and validation, (d) losses of training and validation 
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Figure 6. Dropouts before third and fourth convolutional and also before flatten layer; (a) model architecture, 
(b) confusion matrix, (c) accuracy of training and validation, (d) losses of training and validation 
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Figure 6. Replacing flatten with global average pooling; (a) model architecture, (b) confusion matrix 
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Figure 6. Replacing flatten with global average pooling; (c) accuracy of training and validation, (d) losses of 
training and validation (continue) 


We can see that dropout has great benefit, but this benefit can’t be taken unless when we put 
dropout in the right place, we can see that there were several changes to the place and number of dropouts 
when had seen that dropouts increases the accuracy when it was added before the third convolutional and 
flatten layers but decreased slightly decreased when added before fourth convolutional layer. Then the place 
of dropouts was changed to be before and after flatten layer, in which we got the maximum accuracy, after 
that this increasing tested for the model with a smaller number of layers for the aim of decreasing the number 
of parameters. The results were helpful, then complete the operation of parameters decreasing, flatten layer 
has been replaced by Global-average-pooling, which reduced the number of parameters for the model by two 
million parameters. 


5. CONCLUSION 

We have designed a model that can be used for online human activity recognition using moderate 
computation machine. The accuracy of our proposed model was raised to 93.33%, and 92.92% for the model 
with reduced amount of parameters. The last presented model is useful for moderate computation capabilities 
machines, due to its low number of parameters and a low number of mathematical operations. We have 
reached this high accuracy by getting the benefit of dropouts, and decreasing learning rate during training 
when there is no improvement. The model with a low number of mathematical operations could be used for 
online human activity recognition in a smart houses, helping monitoring human activities in the houses. We 
intend to do more augmentation for the data to increase the overall accuracy, where only flipping 
augmentation is made to the data. 
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